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Abstract 


While automated essay scoring (AES) can re- 
liably grade essays at scale, automated writing 
evaluation (AWE) additionally provides forma- 
tive feedback to guide essay revision. How- 
ever, a neural AES typically does not provide 
useful feature representations for supporting 
AWE. This paper presents a method for link- 
ing AWE and neural AES, by extracting Top- 
ical Components (TCs) representing evidence 
from a source text using the intermediate out- 
put of attention layers. We evaluate perfor- 
mance using a feature-based AES requiring 
TCs. Results show that performance is compa- 
rable whether using automatically or manually 
constructed TCs for 1) representing essays as 
rubric-based features, 2) grading essays. 


1 Introduction 


Automated essay scoring (AES) systems reliably 
grade essays at scale, while automated writing eval- 
uation (AWE) systems additionally provide forma- 
tive feedback to guide revision. Although neural 
networks currently generate state-of-the-art AES 
results (Alikaniotis et al., 2016; Taghipour and Ng, 
2016; Dong et al., 2017; Farag et al., 2018; Jin et al., 
2018; Li et al., 2018; Tay et al., 2018; Zhang and 
Litman, 2018), non-neural AES create feature rep- 
resentations more easily useable by AWE (Roscoe 
et al., 2014; Foltz and Rosenstein, 2015; Crossley 
and McNamara, 2016; Woods et al., 2017; Madnani 
et al., 2018; Zhang et al., 2019). We believe that 
neural AES can also provide useful information for 
creating feature representations, e.g., by exploiting 
information in the intermediate layers. 

Our work focuses on a particular source-based 
essay writing task called the response-to-text as- 
sessment (RTA) (Correnti et al., 2013). Recently, 
an RTA AWE system (Zhang et al., 2019) was built 
by extracting rubric-based features related to the 
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use of Topical Components (TCs) in an essay. How- 
ever, manual expert effort was first required to cre- 
ate the TCs. For each source, the TCs consist of 
a comprehensive list of topics related to evidence 
which include: 1) important words indicating the 
set of evidence topics in the source, and 2) phrases 
representing specific examples for each topic that 
students need to find and use in their essays. 

To eliminate this expert effort, we propose a 
method for using the interpretable output of the 
attention layers of a neural AES for source-based 
essay writing, with the goal of extracting TCs. We 
evaluate this method by using the extracted TCs 
to support feature-based AES for two RTA source 
texts. Our results show that 1) the feature-based 
AES with TCs manually created by humans is 
matched by our neural method for generating TCs , 
and 2) the values of the rubric-based essay features 
based on automatic TCs are highly correlated with 
human Evidence scores. 


2 Related Work 


Three recent AWE systems have used non-neural 
AES to provide rubric-specific feedback. Woods 
et al. (2017) developed an influence estimation pro- 
cess that used a logistic regresion AES to identify 
sentences needing feedback. Shibani et al. (2019) 
presented a web-based tool that provides formative 
feedback on rhetorical moves in writing. Zhang 
et al. (2019) used features created for a random 
forest AES to select feedback messages, although 
human effort was first needed to create TCs from 
a source text. We automatically extract TCs using 
neural AES, thereby eliminating this expert effort. 

Others have also proposed methods for pre- 
processing source information external to an es- 
say. Content importance models for AES predict 
the parts of a source text that students should in- 
clude when writing a summary (Klebanov et al., 
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Source Excerpt: Today, Yala Sub-District Hospital has medicine, free of charge, for all of the most common diseases. Water 
is connected to the hospital, which also has a generator for electricity. Bed nets are used in every sleeping site in Sauri... 


Essay Prompt: The author provided one specific example of how the quality of life can be improved by the Millennium Villages 
Project in Sauri, Kenya. Based on the article, did the author provide a convincing argument that winning the fight against poverty is 
achievable in our lifetime? Explain why or why not with 3-4 examples from the text to support your answer. 


Essay: In my opinion I think that they will achieve it in lifetime. During the years threw 2004 and 2008 they made progress. 
People didnt have the money to buy the stuff in 2004. The hospital was packed with patients and they didnt have alot of treatment 
in 2004. In 2008 it changed the hospital had medicine, free of charge, and for all the common dieases. Water was connected 
to the hospital and has a generator for electricity. Everybody has net in their site. The hunger crisis has been addressed with 
fertilizer and seeds, as well as the tools needed to maintain the food. The school has no fees and they serve lunch. To me thats 
sounds like it is going achieve it in the lifetime. 


Table 1: A source excerpt for the RT’Ay;y p prompt and an essay with score of 3. 


Table 2: The Evidence score distribution of RTA. 


2014). Methods for extracting important keywords 
or keyphrases also exist, both supervised (unlike 
our approach) (Meng et al., 2017; Mahata et al., 
2018; Florescu and Jin, 2018) and unsupervised 
(Florescu and Caragea, 2017). Rahimi and Litman 
(2016) developed a TC extraction LDA model (Blei 
et al., 2003). While the LDA model considers all 
words equally, our model takes essay scores into 
account by using attention to represent word impor- 
tance. Both the unsupervised keyword and LDA 
models will serve as baselines in our experiments. 

In the computer vision area, attention cropped 
images have been used for further image classifi- 
cation or object detection (Cao et al., 2015; Yuxin 
et al., 2018; Ebrahimpour et al., 2019). In the NLP 
area, Lei et al. (2016) proposed to use a genera- 
tor to find candidate rationale and these are passed 
through the encoder for prediction. Our work is 
similar in spirit to this type of work. 


3 RTA Corpus and Prior AES Systems 


The essays in our corpus were written by students 
in grades 4 to 8 in response to two RTA source 
texts (Correnti et al., 2013): RT Aysy p (2970 es- 
says) and RTA space (2076 essays). Table 1 shows 
an excerpt from RT Ajyyp, the associated essay 
writing prompt, and a student essay. The bolding 
in the source indicates evidence examples that ex- 


Prompt | RTAuvp RT Aspace perts manually labeled as important for students 
peat a ) re ) to discuss (i.e., TC phrases). Evidence usage in 
Score 2 1197 789 each essay was manually scored on a scale of 1 

(40%) (38%) to 4 (low to high). The distribution of Evidence 
Score3) 616 512 scores is shown in Table 2. The essay in Table 1 
Sided pi oo received a score of 3, with the bolding indicating 

(10%) (11%) phrases semantically related to the TCs from the 

Total 2970 2076 source text. 


To date, two approaches to AES have been pro- 
posed for the RTA: AE'S)y5ric and AE Speyral. TO 
support the needs of AWE, AF'S;,5-;¢ (Zhang and 
Litman, 2017) used a traditional supervised learn- 
ing framework where rubric-motivated features 
were extracted from every essay before model train- 
ing - Number of Pieces of Evidence (NPE) ' Con- 
centration (CON), Specificity (SPC) *, Word Count 
(WOC). The two aspects of TCs introduced in Sec- 
tion | (topic words, specific example phrases) were 
used during feature extraction. 

Motivated by improving stand-alone AES perfor- 
mance (i.e., when an interpretable model was not 
needed for subsequent AWE), Zhang and Litman 
(2018) developed AE'Seurat, a hierarchical neural 
model with the co-attention mechanism in the sen- 
tence level to capture the relationship between the 
essay and the source. Neither feature engineering 
nor TC creation were needed before training. 


4 Attention-Based TC Extraction: T'C,41,, 


In this section we propose a method for extract- 
ing TCs based on the AES) ¢y,-q, attention level 
outputs. Since the self-attention and co-attention 
mechanisms were designed to capture sentence 
and phrase importance, we hypothesize that the 
attention scores can help determine if a sentence or 


‘An integer feature based on the list of topic words for 
each topic. 

”A vector of integer values indicating the number of spe- 
cific example phrases (semantically) mentioned in the essay 
per topic. 
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attnsent 


0.00420 


attnphrase 
0.23372 


| No.| Sentences 

1 People didn’t have the money to 
| buy the stuff in 2004. 
The hunger crisis has been addressed 
with fertilizer and seeds, as well as 
the tools needed to maintain the food. 
3 The school has no fees and they 
| serve lunch. 


N 


0.08709 


0.10686 0.63369 


| 
0.62848 


Table 3: Example attention scores of essay sentences. 


phrase has important source-related information. 

To provide intuition, Table 3 shows examples 
sentences from the student essay in Table 1. Bolded 
are phrases with the highest self-attention score 
within the sentence. Italics are specific example 
phrases that refer to the manually constructed TCs 
for the source. Attnsenz is the text to essay atten- 
tion score that measures which essay sentences 
have the closest meaning to a source sentence. 
Attnphrase is the self-attention score of the bolded 
phrase that measures phrase importance. A sen- 
tence with a high attention score tends to include at 
least one specific example phrase, and vice versa. 
The phrase with the highest attention score tends 
to include at least one specific example phrase if 
the sentence has a high attention score. 

Based on these observations, we first extract the 
output of two layers from the neural network: 1) 
the attnsenz of each sentence, and 2) the output of 
the convolutional layer as the representation of the 
phrase with the highest attnpprase in each sentence 
(denoted by cnnpprase). We also extract the plain 
text of the phrase with the highest attnypprase in 
each sentence (denoted by textpnrase). Then, our 
TCattn method uses the extracted information in 
3 main steps: 1) filtering out textpprase from sen- 
tences with low attngsenz, 2) clustering all remain- 
ing textyhrase based ON CNNphrase, and 3) generat- 
ing TCs from clusters. 

The first filtering step keeps all textnrase where 
the original sentences have attnsenz higher than 
a threshold. The intuition is that lower attngent 
indicates less source-related information. 

The second step clusters these textpnrase based 
on their corresponding representations CnNphrase- 
We use k-medoids to cluster textpprase into M 
clusters, where // is the number of topics in the 
source text. Then, for textphrase in each topic 
cluster, we use k-medoids to cluster them into NV 
clusters, where N is the number of the specific 
example phrases we want to extract from each topic. 
The outputs of this step are M x N clusters. 

The third step uses the topic and example clus- 


Layer Parameter Name Value 
Embedding | Embedding dimension 50 
Word-CNN Kernel size 5 

Number of filters 100 
Sent-LSTM Hidden units 100 
Modeling Hidden units 100 
Dropout Dropout rate 0.5 
Others Epochs 100 
Batch size 100 
Initial learning rate 0.001 
Momentum 0.9 


Table 4: Hyper-parameters for neural training. 


TCy, vj AES rubric 
Source Human | y 
Text il Expert [) TCrmanual | Features 
and 
Score 
>) AESheural [I TCattn | 
bd 


=o 


Student 


[> AWE 
Essays TCida 


rubric 


Figure 1: An overview of four TC extraction systems. 


tering to extract TCs. As noted earlier, TCs in- 
clude two parts: topic words, and specific example 
phrases. Since our method is data-driven and stu- 
dents introduce their vocabulary into the corpus, 
essay text is noisy. To make the TC output cleaner, 
we filter out words that are not in the source text. 
To obtain topic words, we combine all tertphrase 
from each topic cluster to calculate the word fre- 
quency per topic. To make topics unique, we assign 
each word to the topic cluster in which it has the 
highest normalized word frequency. We then in- 
clude the top Kytopic words based on their frequency 
in each topic cluster. To obtain example phrases, 
we combine all textpprase from each example clus- 
ter to calculate the word frequency per example, 
then include the top Kexampie Words based on their 
frequency in each example cluster. 


5 Experimental Setup and Results 


Figure 1 shows an overview of four TC extraction 
methods to be evaluated. TCinanuagy (upper bound) 
uses a human expert to extract TCs from a source 
text. T’Caten is our proposed method and automat- 
ically extracts TCs using both a source text and 
student essays. T’Cjdq (Rahimi and Litman, 2016) 
(baseline) builds on LDA to extract TCs from stu- 
dent essays only, while TC, (baseline) builds on 
PositionRank (Florescu and Caragea, 2017) to in- 
stead extract TCs from only the source text. 

Since PositionRank is not designed for TC ex- 
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Prompt Component Parameter TCida TCyr TCattn Prompt | TCmanuat 1) TChaa (2) TCpr (3) TCatin (4) 
Topic Words Number of Topics | 9 1916 RTAmvp | 0.643 (2,3) 0.6143) (0.525 0.648 (1,2,3) 
RTAuvp Number of Words | 30 2025 RT Agypace | 0.609 (3) 0.615 (3) 0.559 0.622 (1,3) 
eee Fxaiinlé Phiases Number of Topics 20 1 18 
Xampre STESeS | Number of Phrases | 15 20 15 
Topic Words Numberok Topics, =) E9205 10 Table 6: The performance (QWK) of AESpubric US- 
RT Ac Number of Words 10 10 20 7 . e 
Aspace Example Phrases | Number of Topies | ~ 10 1 9 ing different TC extraction methods for feature cre- 
Mumberok Fhrases:| 20 20h 20 ation. The numbers in the parentheses show the model 


Table 5: Parameters for different models. 


traction, we needed to further process its output to 
create T’C’,,. To extract topic words, we extract 
all keywords from the output. Next, we map each 
word to a higher dimension with word embedding. 
Lastly, we cluster all keywords using k-medoids 
into PRiopic topics. To extract example phrases, 
we put them into only one topic and remove all 
redundant example phrases if they are subsets of 
other example phrases. 

We configure experiments to test two hypotheses: 
H1) the AE'S;,.,5,ic model for scoring Evidence 
(Zhang and Litman, 2017) will perform compara- 
bly when extracting features using either T’Caten 
or TCmanual, and will perform worse when us- 
ing T'Clda or T'Cyr; H2) the correlation between 
the human Evidence score and the feature values 
(NPE and sum of SPC features)’ will be compa- 
rable when extracted using T’Catin and TC manual, 
and will be stronger than when using T’'Cigq and 
TCpr. The experiment for H1 tests the impact of 
using our proposed TC extraction method on the 
downstream AE'S}.,bric task, while the H2 experi- 
ment examines the impact on the essay representa- 
tion itself. 

Following Zhang and Litman (2017), we stratify 
essay corpora: 40% for training word embeddings 
and extracting TCs, 20% for selecting the best em- 
bedding and parameters, and 40% for testing. We 
use the hyper-parameters from Zhang and Litman 
(2018) for neural training as shown in Table 4. Ta- 
ble 5 shows all other parameters selected using the 
development set. 

Results for H1. H1 is supported by the results in 
Table 6, which compares the Quadratic Weighted 
Kappa (QWK) between human and AE'S/pi¢ Ev- 
idence scores (values 1-4) when AE'S,.,ppj¢ USES 
TCmanual Versus each of the automatic methods. 
TCattin always yields better performance, and even 
significantly better than T’Cpanual- 

Results for H2. The results in Table 7 support 
H2. T Cotten outperforms the two automated base- 


These features are extracted based on TCs. 


numbers over which the current model performs signif- 
icantly better (p < 0.05). The best results between 
automated methods in each row are in bold. 


Prompt Feature TCmanual TCida TCpr TCaten 

Pi dea, EE 0.542 0.482 0.587 0.639 
MVP spC(sum)| 0.689 0.585 (0.365 —«0.679 

re NPE 0.484 ~=0.513. «0.494 _~—«0.625 
Space SpC(sum)| 0.601 0.574 0.533 0.598 


Table 7: Pearson’s r comparing feature values com- 
puted using each TC extraction method with human 
(gold-standard) Evidence essay scores. All correlation 
values are significant (p < 0.05). The best results be- 
tween automated methods in each row are in bold. 


lines, and for NPE even yields stronger correlations 
than the manual TC method. 

Qualitative Analysis. The manually-created 
topic words for RTAjyyp represent 4 topics, 
which are “hospital”, “malaria”, “farming” and 
“school’*. Although Table 5 shows that the au- 
tomated list has more topics for topic words and 
might have broken one topic into separate topics, 
a good automated list should have more topics re- 
lated to the 4 topics above. We manually assign a 
topic for each of the topic words from the different 
automated methods. T’Cjdq has 4 related topics out 
of 9 (44.44%), T’'Cp, has 6 related topics out of 19 
(31.58%), and TC ttn has 10 related topics out of 
16 (62.50%). Obviously, T}Cottn preserves more 
related topics than our baselines. 

Moving to the second aspect of TCs (specific 
example phrases), Table 8 shows the first 10 spe- 
cific example phrases for a manually-created cat- 
egory that introduces the changes made by the 
MVP project®. This category is a mixture of dif- 
ferent topics because it talks about the “hospital”, 
“malaria”, “school”, and “farming” at the same time. 
TCattn has overlap with TCinanuai on different 
topics. However, T'Cigq mainly talks about “hospi- 
tal”, because the nature of the LDA model doesn’t 
allow mixing specific example phrases about dif- 
ferent topics in one category. Unfortunately, T'C;,, 


+All Topic Words generated by different models can be 
found in the Appendix A.1. 

>All Specific Example Phrases generated by different mod- 
els can be found in the Appendix A.2. 
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TCinanuat TCida 


TCpr TCattn 


progress just four years 
medicine most common diseases 
water connected hospital 
hospital generator electricity 
bed nets used every sleeping site 
hunger crisis addressed fertilizer seeds 
tools needed maintain food supply 
no school fees 
school attendance rate way up 
kids go school now 


running water electricity 
patients afford 


share beds 
recieve treatment 


doctors clinical 
water fertilizer knowledge 
receive treatment 


water connected hospital generator electricity 


rooms packed patients probably 


doctor clinical officer running hospital 


brighter future hannah electricity running water irrigation set 
millennium villages project —_ poor showed treatment school supplies 
unpaved dirt road farmers could crops afford bed 
bar sauri primary school electricity hospital 
future hannah better fertilizer medicine enough also 
sauri primary school rooms packed patients 
villages project food fertilizer crops get supply 
millennium development goals five net costs 5 
nets net bed free 
running water supplies schools almost 


village leaders 
dirt road 


Table 8: Specific example phrases for the RT Agsy p progress topic. 


does not include any overlapped specific phrase in 
the first 10 items; they all refer to some general 
example phrases from the beginning of the source 
article. Although there are some related specific 
example phrases in the full list, they are mainly 
about school. This is because the PositionRank 
algorithm tends to assign higher scores to words 
that appear early in the text. 


6 Conclusion and Future Work 


This paper proposes T’C atin, a method for using the 
attention scores in a neural AES model to automat- 
ically extract the Topical Components of a source 
text. Evaluations show the potential of TCatin 
for eliminating expert effort without degrading 
AE Spubric performance or the feature represen- 
tations themselves. T’'Cazzn outperforms baselines 
and generates comparable or even better results 
than a manual approach. 

Although T’Cattn outperforms all baselines and 
requires no human effort on TC extraction, annota- 
tion of essay evidence scores is still needed. This 
leads to an interesting future investigation direc- 
tion, which is training the AE'S;eu;q; using the 
gold standard that can be extracted automatically. 

One of our next steps is to investigate the im- 
pact of TC extraction methods on a corresponding 
AWE system (Zhang et al., 2019), which uses the 
feature values produced by AE'S}.,bric to generate 
formative feedback to guide essay revision. 

Currently, the T’Cigq are trained on student es- 
says, while the 7’C’,, only works on the source 
article. However, T’Cat¢;, uses both student essays 
and the source article for TC generation. It might 
be hard to say that the superior performance of 
TCatin is due to the neural architecture and atten- 
tion scores rather than the richer training resources. 
Therefore, a comparison between T'Cottn, and a 
model that uses both student essays and the source 
article is needed. 
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A Appendices 


A.1 Topic Words Results 


Table 9 shows all topic words for the RT Aysyp 
from TCinanual- Table 10 shows all topic words 
for the RT Ayyp from TCigq. Table 11 shows 
all topic words for the RT Ayyy p from TC;,. Ta- 
ble 12 shows all topic words for the RT Ayyyp 
from T’Cotin- 


A.2 Specific Example Phrases Results 


Table 13 shows all specific example phrases for 
the RT Ayy p from TC manual. Table 14 shows all 
specific example phrases for the RT Aysy p from 
TCida- Table 15 shows all specific example phrases 
for the RT Ayyyvp from TC,,. Table 16 shows all 
specific example phrases for the RT Ajyy p from 
TCattn- 
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Topic 1 Topic 2 Topic 3 Topic 4 


care bed farmer school 
health net fertilizer supplies 
hospital malaria _ irrigation fee 
treatment infect dying student 
doctor bednet crop midday 
electricity mosquito seed meal 
disease bug water lunch 
water sleeping harvest supply 
sick die hungry book 
medicine cheap feed paper 
generator infect food pencil 
no biting energy 
die free 
kid children 
bed kid 
patient go 
clinical attend 
officer 
running 


Table 9: Topic words of TCmanuat- 
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Category 1 
brighter future hannah 
millennium villages project 
unpaved dirt road 
bar sauri primary school 
future hannah 
sauri primary school 
villages project 
millennium development goals 
village leaders 
dirt road 
car jump 
little kids 
preventable diseases people 
many kids 
diseases people 
kids die 
school supplies 
primary school 
school fees 
infect people 


Table 15: Specific example phrases of T’C>,. 
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